Check what the mean values are per gender
Check what the mean values are per gender
One category is chosen as the base/reference level: the first alphabetically.
mod_gender <- lm(weight_kg ~ gender,weight_data) summary(mod_gender)
## ## Call: ## lm(formula = weight_kg ~ gender, data = weight_data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -13.4169 -3.6867 -0.4045 3.3929 15.0021 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 41.7536 0.4069 102.61 <2e-16 *** ## gendermale 7.1569 0.5909 12.11 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 5.368 on 329 degrees of freedom ## Multiple R-squared: 0.3084, Adjusted R-squared: 0.3063 ## F-statistic: 146.7 on 1 and 329 DF, p-value: < 2.2e-16
What if we want the other to be the reference level?
gender_levels <- c("male", "female")
weight_data %>%
mutate(gender = factor(gender, levels=gender_levels)) %>%
lm(weight_kg ~ gender,.) %>%
summary
## ## Call: ## lm(formula = weight_kg ~ gender, data = .) ## ## Residuals: ## Min 1Q Median 3Q Max ## -13.4169 -3.6867 -0.4045 3.3929 15.0021 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 48.9105 0.4284 114.17 <2e-16 *** ## genderfemale -7.1569 0.5909 -12.11 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 5.368 on 329 degrees of freedom ## Multiple R-squared: 0.3084, Adjusted R-squared: 0.3063 ## F-statistic: 146.7 on 1 and 329 DF, p-value: < 2.2e-16
All betas are the different from the reference level
mod_iris <- lm(Sepal.Length ~ Species,iris) summary(mod_iris)
## ## Call: ## lm(formula = Sepal.Length ~ Species, data = iris) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.6880 -0.3285 -0.0060 0.3120 1.3120 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.0060 0.0728 68.762 < 2e-16 *** ## Speciesversicolor 0.9300 0.1030 9.033 8.77e-16 *** ## Speciesvirginica 1.5820 0.1030 15.366 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.5148 on 147 degrees of freedom ## Multiple R-squared: 0.6187, Adjusted R-squared: 0.6135 ## F-statistic: 119.3 on 2 and 147 DF, p-value: < 2.2e-16
What can we do if we want to know more than whether “versicolor” and “virginica” differ from “setosa”?
First, try the simple way: rearrange your data to highlight what you want to know
Slightly more complicated (and in this case, unnecessary) way: use a package to compare each pair
library(emmeans) iris_species <- emmeans(mod_iris, "Species") pairs(iris_species)
## contrast estimate SE df t.ratio p.value ## setosa - versicolor -0.930 0.103 147 -9.033 <.0001 ## setosa - virginica -1.582 0.103 147 -15.366 <.0001 ## versicolor - virginica -0.652 0.103 147 -6.333 <.0001 ## ## P value adjustment: tukey method for comparing a family of 3 estimates
y ~ intercept + bx + error
Why not:
\(y \sim b_0 + b_1*x_1 + b_2*x_2 + ... + \epsilon\)
Starting with faux data
What’s the correlation between x1 and x2 here?
faux_data <- rnorm_multi(n = 100,
mu = c(3, 2, 1),
sd = c(3, 3, 3),
r = c(0.3, 0.5, 0),
varnames = c("outcome", "x1", "x2"),
empirical = TRUE)
lm(outcome ~ x1 + x2, faux_data) %>% summary
## ## Call: ## lm(formula = outcome ~ x1 + x2, data = faux_data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.4826 -1.5931 -0.0941 1.4871 6.3893 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.90000 0.30764 6.176 1.54e-08 *** ## x1 0.30000 0.08249 3.637 0.000444 *** ## x2 0.50000 0.08249 6.062 2.58e-08 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.462 on 97 degrees of freedom ## Multiple R-squared: 0.34, Adjusted R-squared: 0.3264 ## F-statistic: 24.98 on 2 and 97 DF, p-value: 1.77e-09
Are the betas what you expect? The \(R^2\) = 0.34. This tells us how much variance in y we can predict, given x1 and x2 (more on this later)
But having two completely uncorrelated predictors is unusual.
How much of a difference does it make if they are correlated?
When do we need to worry?
Go back and model the weight data with 2 predictors: height and gender.
What happens?